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1  Summary 

The  extraction  of  relevant  information  from  large-scale  data  requires  an  ac¬ 
curate  model  of  the  data  source.  In  practice,  such  a  model  has  to  be  generic 
enough  to  encompass  data  sources  of  different  modality  but  also  mathemati¬ 
cally  tractable  to  allow  a  fast  rate  of  learning.  We  have  selected  exponentially 
embedded  family  (EEF)  of  probability  density  functions  (PDFs)  for  informa¬ 
tion  extraction.  The  EEF  belongs  to  the  exponentially  family  of  PDFs,  and 
therefore  inherits  many  important  properties  to  allow  efficient  information  ex¬ 
traction  and  learning  from  data.  First  of  all,  it  admits  sufficient  statistics  and 
therefore,  provides  the  means  for  selecting  good  models.  It  also  easily  allows 
additional  sensor  statistics  to  be  evaluated  and  possibly  incorporated  into  the 
model.  The  exponential  family  is  a  reproducible  probability  density  function 
family  (a  Gaussian  is  a  special  case),  whose  conjugate  density  is  also  of  the 
exponential  family  type.  For  unsupervised  scenarios  in  which  unknown  param¬ 
eters  need  to  be  estimated,  the  maximization  of  the  likelihood  leads  to  a  convex 
optimization  problem,  and  thus  can  be  easily  implemented.  Furthermore,  the 
model  is  sufficiently  general  to  encompass  any  real-world  situation.  The  em¬ 
bedded  exponential  family  approach  yields  a  means  of  on-line  assessment  of 
performance  since  the  Kullback-Liebler  divergence  is  a  part  of  the  model.  The 
rate  of  learning  is  also  readily  found  since  the  Kullback-Liebler  divergence  can 
be  used  to  ascertain  distances  between  PDFs  for  various  hypothesis  testing 
scenarios. 

We  have  focused  on  the  mathematical  formalism  and  stochastic  machine 
learning  algorithms  for  extraction  of  relevant  information  based  on  the  EEF, 
learning  and  classification  over  large-scale  stream  data,  and  information  fusion 
and  integration.  In  particular,  we  have  proposed  a  PDF  estimation  approach 
based  on  the  EEF,  and  a  measure  for  assessment  of  information  from  sensors. 
We  have  also  taken  advantage  of  the  model  structure  information  for  model 
estimation.  Furthermore,  we  have  proved  a  general  Pythagorean  theorem  for 
the  EEF  and  studied  a  multipath  scenario  for  sensor  selection.  Finally,  we  also 
analyzed  and  developed  a  series  of  machine  learning  techniques  for  effective 
data  learning,  classification,  and  decision  making,  including  adaptive  incre¬ 
mental  learning  from  stream  data,  information  fusion  with  multiple  learning 
models/hypotheses,  machine  learning  with  non-stationary  imbalanced  stream 
data,  kernel  density  estimation  based  on  self-organizing  map(SOM),  among 
others.  These  results  have  been  published  in  peer-reviewed  conferences  and 
journals,  including  IEEE  Transactions  on  Neural  Networks,  IEEE  Transac¬ 
tions  on  Neural  Networks  and  Learning  Systems,  Neurocomputing  (Elsevier), 


i 

Approved  for  public  release;  distribution  unlimited. 


a  book  chapter  with  Wiley-IEEE,  among  others. 


2  EEF  for  Estimation  of  PDF 


Consider  a  problem  where  we  have  two  sensors  (extension  is  straightforward 
for  multiple  sensors)  as  shown  in  Figure  1.  We  use  the  EEF  to  assess  the 
significance  of  the  information  contributed  by  T2  for  a  decision. 


PT1,T2(ti-,h','Ho) 


Figure  1:  Signal  Detection  Problem  Overview 


1.  Assume  PDF  for  background  ('Ho)  is  known  -  in  practice  can  measure  this 

before  event  of  interest  takes  place. 

2.  Construct  best  approximation  to  PDF  using  PDF  under  Ho  and  Ti  and  T2 

data  under  Hi . 

3.  Produces  EEF  PDF  approximation 

pr/(t)  =  exp  [?7ifi  +  r)ih  -  K( 771,772)  +lnpT(t;H0)] 

We  have  to  choose  the  embedding  parameters  771,772.  Note  that  if  we  decide 
772  =  0,  this  is  equivalent  to  ignoring  sensor  2  output. 

To  minimize  the  Kullback-Liebler  distance  ( determines  prob.  of  detection 
performance),  we  choose  embedding  parameters  so  that  approximate  PDF  has 
moments  matched  to  true  PDF  under  Hi,  which  are  presumed  known,  i.e., 

Et[Ti ]  =  Ai  Et[T2]  =  A2  (from  data  under  Hi) 
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This  is  equivalent  to  Gram-Schmidt  orthogonalization  for  Gaussian  PDFs  (see 
Figure  2). 


Figure  2:  Best  Approximation 
For  one  sensor  we  construct 


Pr){  Prii,m=o{^l )  ^2) 
and  for  two  sensors  we  construct 

Pr y  =  Pri{,T&{ti,h) 

Information  content  of  T2(x)  is 

D(Pr}t,Tfi(tii  ^2)1  \Ptj* .7)2=0 (f  1  >  G))  =  reduction  in  distance  to  true  PDF 
where  D(pi\\p2)  is  Kullback-Liebler  distance 

DiPiWPi)  =  /PlWln^ydx 

3  Assessment  of  Information  from  Sensors  for 
the  EEF 

We  have  proposed  a  measure  of  information  increase  for  the  exponentially  em¬ 
bedded  family,  when  new  sensors  are  added.  This  measure  is  always  greater 
than  or  equal  to  zero,  which  implies  that  by  adding  new  sensors,  we  could 
always  obtain  some  information  (or  at  least  keep  the  same  information  if  new 
sensors  are  redundant).  We  have  proved  that  the  information  provided  by 
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independent  sensors  is  additive.  Based  on  this  measure,  we  can  decide  which 
sensor  provides  the  most  information  and  select  the  best  combination  of  sen¬ 
sors.  We  can  also  find  redundant  sensors  if  this  measure  becomes  zero.  There¬ 
fore,  we  can  perform  sensor  selection  and  sensor  reduction  using  this  measure 
of  information  increase. 

Assume  that  we  have  only  one  sensor  Tx(x),  then  the  EEF  is  constructed 
as 

Prft(x)  =  exp  [rfi Tx(x)  -  tfx(»h)  +  lnp0(x)]  (1) 

j\ 

where  K i(i71)  =  In  (/  exp,  [T/fTx(xj]'p0(x)dx)  =  lnE0  (exp  [rjfT^x)]).  Since 
r]l  are  the  unknown  parameters,  we  find  the  MLE  of  i)l  by  maximizing  prjl  (x) 
or  equivalently  maximizing  7jfTx(x)  —  ATx(t7x).  Taking  the  derivative  and 
setting  it  to  zero,  the  MLE  should  satisfy 


Ti(x) 


dKjJjh) 

dVi 


Vi 


(2) 


The  KL  divergence  between  the  true  PDF  pt(x)  and  the  EEF  /^(x)  is 

DiPtWPr],)  =  [  a(x)  In  ~^~-dx 

=  J Pf(x)  [lnpt(x)  -  lnpo(x)  -  (^fTi(x)  -  Kx(tjx))]  dx 

=  D(pt\\po)  -  [rfiEt  (Tx(x))  -  ATx(tjx)]  (3) 

Let  r/(1)+  be  the  t]1  that  minimizes  D(pt\\prj  )  or  equivalently  maximizes  rjJ Et  (Tx(x))- 
AA(r/1).  Then  tj^*  should  satisfy 


Et(  Ti(x)) 


dK  i(Vi) 


dV\ 


(4) 


In  the  case  when  we  have  L  IID  unobserved  samples  xl5  x2, . . .  ,xl  and  we 
only  observe  L  IID  outputs  Tx(xx),  Tx(x2), . . . ,  Tx(xL)  from  the  same  sensor, 
(2)  can  be  extended  as 


7  !>(*■)  = 


i= 1 


dKMl 

drji 


Since  {  £f=1  Tx(xt)  4  Et  (Tx(x))  as  L  — »  oo,  it  can  be  shown  that 


-  P,  (i)* 


(5) 

(6) 


4 

Approved  for  public  release;  distribution  unlimited. 


and 


D(Pt\\Pff)  ^  D(pt\\p, nM.)  =  D{pt\\p0 )  -  max  [rflEt  (T^x))  -  #1(771)]  (7) 

n  </! 


Say  we  add  another  sensor  T2(x),  the  EEF  becomes 

Pt?i,T72(x)  =  exP  [^[ti(x)  +  77^T2(x)  -  K(rh,ri2)  +  \npo(x)]  (8) 

where  K(rjl,r)2)  =  In  E0  (exp  [t7^’T1(x)  +  ^^(x)]).  Note  that  when  t)2  =  0, 
the  EEF  in  (10)  reduces  to  the  EEF  in  (1)  because 

AT(t7i,0)  =  lnE0  (exp  [r/^T^x)])  =  Kx{ r)x)  (9) 


The  MLE  of  T7X  and  ri2  are  obtained  by  solving 


Ti(x)  = 


T2(x)  - 


dKjy uJh) 

dr)  i 

dK{Vi,rh) 

dr)  2 


flvf)2 

f)lf)2 


(10) 


The  KL  divergence  between  the  true  PDF  pt(x)and  the  EEF  Prj^i 72(x)  is 


D(Pt\\Pr)vr)2)  =  D(pt\\po)-[T)jEt(T1(x))  +r7^£t(T2(x))  -K(r)uri2)]  (11) 


Let  r)i  and  r)2  be  the  T)l  and  t)2  that  minimize  D(pt\\pr)1,T)2)  or  equivalently 

maximize  r)jEt  (Ti(x))  +  r)2  Et  (T2{x.))  —  K(t)1,  t)2).  Then  j )^]*  and  ri'.p* 
satisfy 


Et  (Tt(x)) 
Et(  T2(x)) 


dK(rh,r)2) 


dr)  i 

dK(r)x.r)2) 

dr)2 


r)™\r)™' 


T7)2)*M2)* 


(12) 


Similarly,  it  can  be  shown  that 


fh 

r)2  ^  r)2  ]* 
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and 


D(pt\ \Pi)1,i]2)  D(pt\ Ip^w*  jjW*) 

=  D(pt\\po)  -  max  [rjjEt  (Ti(x))  +  f]\Et  (T2(x))  -  K^,^)] 

*1 1**1 2 

Since 

max  \rjjEt  (Ti(x))  +  r^Et  (T2(x))  -  Rfa,  r}2)] 

*  1  ’  *#2 

>max  rjiEt  (Ti(x))  +  r^Et  (T2(x))  -  Kfa,  r?2)|??2=0 
=  max  [77^  (Ti(x))  -  ^i(r?i)]  (13) 

'/l 

where  the  last  equality  is  from  (11).  Therefore,  from  (9)  and  (14),  we  conclude 
that  KL  divergence  between  the  true  PDF  and  the  EEF  will  decrease  by  adding 
another  sensor.  The  difference 

DiPtWPjjm*)  ~  DiptWPjjV.  ^2).) 

=  max  [rfiEt  (Tx(x))  +  r\\Et  (T2(x))  -  K(tj1,tj2)]  -  max  [r{[Et  (Ti(x))  -  K^rj^] 

'll'' 12  'll 

(14) 

can  be  considered  as  a  measure  of  the  information  increase  by  adding  another 
sensor. 

Also  note  that  if  Ti(x)  and  T2(x)  are  independent  under  Ho,  then 
i,*?2)  =  !n£o  (exp  [r?[T,(x)  +  *#T2(x)]) 

=  In  [E0  (exp  [^fTi(x)])  Eo  (exp  [i]2T2(x)])] 

=  In  E0  (exp  [^fTi(x)] )  +  ln£0  (exp  [ij2T2(x)]) 

=  ATi(t7i)  +  AT2(772)  (15) 

where  K2(tj2)  is  the  normalizing  factor  or  cumulant  generating  function  of  the 
EEF 

P772(x)  =  exp  [»J2  T2(x)  -  K2(tj2)  +  lnpo(x)]  (16) 

constructed  using  the  sensor  T2(x)  only.  In  this  case, 

max  [rfiEt  (Ti(x))  +  rgEt  (T2(x))  -  K(r)l,'q2)\ 

=  max  [rfiEt  (Tj(x))  +  r)\Et  (T2(x))  -  K^)  -  AT2(t72)] 

rti'rh 

=  max  [rfiEt  (Ti(x))  -  #i(»h)]  +  max  [rfiEt  (T2(x))  -  A'2(^2)] 

'/1  *h 

(17) 
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and  (17)  becomes 


DiPtWPfjW*)  ~  D(pt\\priw.  ^(2).) 

=  max  [rfiEt  (T2(x))  -  K2{r]2)]  (18) 

'#2 

4  Model  Estimation  via  Model  Structure  De¬ 
termination 

In  model  estimation,  we  often  face  problems  with  unknown  parameters  in 
the  candidate  models.  This  paper  proposes  the  model  structure  determina¬ 
tion  (MSD)  for  model  estimation  with  unknown  parameters.  We  start  with 
the  problem  of  model  order  selection,  and  decompose  the  probability  density 
function  (PDF)  into  the  information  provided  by  the  data  about  the  model 
parameters  and  that  of  the  model  structure.  The  factor  that  depends  on  the 
model  parameters  is  approximated  using  a  minimax  procedure,  and  the  MSD 
depends  on  the  model  structure  only.  It  is  shown  that  the  MSD  is  equivalent 
to  the  exponentially  embedded  family  (EEF)  for  model  order  selection  under 
some  conditions.  Finally,  we  apply  the  MSD  to  a  classification  problem  where 
we  have  partial  knowledge  about  the  parameters,  and  simulation  results  show 
that  it  outperforms  the  pseudo-maximum  likelihood  (pseudo-ML)  rule. 

4.1  Model  Order  Structure 

We  start  by  considering  the  problem  of  model  order  selection,  where  we  have 
a  set  of  candidate  models  {Adi,  M2,  ■  ■  ■  ,Mm}-  Each  model  has  a  set  of  un¬ 
known  parameters  which  we  denote  as  6l  and  which  has  dimension  pt  x  1 .  Based 
on  an  observation  of  x  =  [x[0]  x[l] . . .  x[N  —  1]]T,  we  wish  to  choose  a  model 
without  knowledge  of  for  each  model.  The  PDF  is  given  as  px(x;  0*,  AT). 
Furthermore,  we  assume  that  a  minimal  sufficient  statistic  exists  for  0,  and  is 
given  by  T*(x).  Dispensing  with  the  particular  model  notation  for  the  time 
being,  note  that  by  the  Neyman-Fisher  factorization  theorem  the  PDF  can  be 
written  as 

Px(x;0)  =  5(t(x),0)h(x)  (19) 

where  t(x)  is  the  p  x  1  sufficient  statistic  for  0 ,  which  is  also  p  x  1.  We  next 
decompose  the  PDF  into  the  information  provided  by  the  data  about  the  model 


i 

Approved  for  public  release;  distribution  unlimited. 


parameters  and  that  of  the  model  structure.  To  do  so  we  have  from  (19)  that 


0)  =  g(t(x),0) 
Px(x;0o )  g(t(x),60) 


where  0O  is  arbitrary.  Next  assume  that  g(t{x),0)  can  be  normalized  to  inte¬ 
grate  to  one  or  if  t  =  t(x),  then 


=  c  <  oo 


for  c  a  constant,  which  does  not  depend  on  6.  As  a  result,  the  PDF  of  t(x)  is 


Pt(  t(x);0)  =  g(t(x),0)/c 


and  we  can  write  (20)  as 


px{x]0)  =px(x;0  0) 


/>/  f?) 

Pt(  t(xj;6>o) 


and  finally  this  becomes 


Px(x-,0) 


Pt( t(x);  0\ 
model  parameters 


Px{x;0  q) 

Pr(t(x);0o) 

' - - - " 

model  structure 


(21) 


The  first  factor  depends  only  on  the  model  parameters  and  since  Oo  is  arbi¬ 
trary  the  second  factor  depends  only  on  the  model  structure  (it  is  functionally 
independent  of  0),  i.e.,  the  dimensionality  of  the  model.  Note  that  in  the  con¬ 
ditional  model  estimator  (CME),  only  the  second  factor  is  used  by  omitting 
the  first  factor.  Here,  we  use  both  factors.  To  illustrate  the  ideas  we  will  use 
the  linear  model,  which  is  important  in  practice. 

For  the  linear  model  x  =  H 0  +  w,  where  H  is  N  x  p,  w  ~  A^(0,<r2I) 
with  a1  known,  it  is  well  known  that  the  minimal  sufficient  statistic  is  t(x)  = 
(HtH)-1Htx  and  furthermore  that  t(x)  ~  N(Q,  cr2(HrH)_1).  Thus, 


Pr(t(x);0) 


1 


(27r)p/2|<72(HTH)"1|1/2 


exp 


-bt(x)  -  e)T?ri(t(x)  -  8) 


and  by  writing 


(x  —  H0)t(x  —  H0)  =  (x  —  Ht(x))T(x  —  Ht(x))  +  (t(x)  —  0)T(HTH)(t(x)  —  0) 
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we  have  that 


Px(x;0)  = 


1 

(0  exp 

( 2tt<jz )  2 


,0  2,g  exP 

(zira2)  2 


1 


2ct2 

1 

’  2  a2 


■  exp 


(x  -  H0)T(x  -  H0) 

(x  —  Ht(x))T(x  —  Ht(x)) 
,(HTH) 


(t(x)  -  0)T— --"-( t(x)  -  G) 


As  a  result  the  model  structure  component  of  the  PDF  is 
Px(x;0o) 


exp  [  2^2  (x  -  Ht(x))T(x  -  Ht(x))] 


Pr(t(x);0oj 


(2jr)p/2|<T2(H7'H)-1|1/2 


1 


(27rrr2)(Af-P)/2 


exp 


1 


2a2 X  P//X  jHrH|1/2  ' 
Jacobian 


1 


(22) 


Note  that  P#  =  I— H(HTH)“1HT  (the  orthogonal  projector  operator)  annihi¬ 
lates  the  signal  H0  and  hence  xTP#x  does  not  depend  on  0.  This  again  shows 
that  the  model  structure  component  is  functionally  independent  of  0.  When 
x  is  transformed  to  t  and  u,  then  the  Jacobian  is  needed.  More  specifically, 
let  the  N  x  (N  —  p)  matrix  B  =  [bi  b2  . . .  bN-p]  consist  of  columns  that  are 
orthonormal  and  span  the  orthogonal  subspace  to  the  columns  of  H.  Hence, 
we  have  that 

x  =  Ht  +  Bu  (23) 

where  u  is  (N  —p)  x  1.  Also,  we  note  that  BTB  =  In-p  and  BTH  =  0.  Hence, 
the  transformation  from  x  to  (t,u)  is  from  (23) 


The  Jacobian  is 


AtA|1//2  = 


HT 

BT 


1/2 


[HB] 


HTH  0 

0  BTB  =  Ijv-p 


hth|1/2 


which  cancels  the  Jacobian  term  in  (22).  Also, 

tl\T  ± 

H 

U 


xTP^x  =  I  [HB] 


[HBl 


t 

u 
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and  noting  that  PffH  =  0  and  Pf<B  =  B,  we  have  that 


Also,  since  t  is  a  complete  sufficient  statistic  and  u  does  not  depend  on  9, 
and  hence  is  ancillary,  by  Basu’s  theorem  t  and  u  are  independent.  Hence,  we 
have  finally  from  (21)  and  (22)  after  transformation  to  t  and  u  that 

t  -  A/'(0,<72(HtH)-1) 

u  ~  M(0,a2lN-p)  (24) 

and  are  independent. 

For  inference  on  9  we  would  generally  discard  u  and  make  decisions  based 
on  the  sufficient  statistic.  This  assumes  that  we  wish  to  test  within  a  given 
PDF  family.  However,  when  choosing  between  models,  i.e.,  between  PDF 
families,  it  is  just  the  opposite.  The  sufficient  statistic,  indicating  information 
about  the  model  parameters,  is  of  no  use  (when  they  are  unknown).  It  is 
actually  u  that  is  important  in  choosing  between  models.  The  main  difficulty 
in  just  discarding  t  is  that  the  remaining  data  vector  u  changes  in  dimension 
as  the  model  changes  and  so  any  decision  is  based  on  various  dimensionality 
data  sets.  This  will  lead  to  unacceptable  results.  In  other  words,  we  cannot 
simply  compare  pcri(Ui|.Mt),s  for  model  estimation  since  Uj’s  may  have  different 
dimensions.  We  must  maintain  the  dimensionality  of  the  data  by  replacing  the 
PDF  of  t  by  one  that  is  known.  We  next  show  how  to  do  this. 

We  keep  the  part  of  the  PDF  of  (t,  u)  that  is  associated  with  u  and  attempt 
to  replace  the  part  associated  with  t  since  without  knowledge  of  9.  it  is  of  no 
use  for  model  estimation  as  discussed  at  the  end  of  the  previous  section.  Hence, 
the  problem  now  reduces  to  finding  a  suitable  approximation  or  estimate  of 
Pt{ t;0)  =  M{9,  cr2(HrH)-1)  =  J\f(9,Ct).  In  order  to  arrive  at  a  meaningful 
solution  we  will  need  to  constrain  the  space  of  9.  Otherwise,  our  approach 
is  not  viable,  as  will  be  shown  later.  Hence,  we  assume  that  9TCf19  — 

—  <  f2 ,  which  is  the  interior  and  boundary  of  an  ellipsoid  in  RN ,  so  that 

the  possible  values  of  9  lie  within  O  =  {6  :  -  l^2U&  <  £2}.  In  many  cases  of 
model  estimation  we  may  already  have  some  idea  as  to  the  possible  limits  of  the 
parameters.  For  example,  0  can  be  considered  as  a  weighted  energy  constraint 
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which  is  often  encountered  in  practice  such  as  communications.  Hence,  such 
a  constraint  may  also  make  practical  sense.  With  this  restriction  we  let  the 
approximating  PDF  be  W(/i,  C)  and  choose  //  and  C  to  best  approximate 
the  original  PDF  M(9,  <t2(HtH)-1).  To  do  so  we  adopt  the  Kullback-Liebler 
(KL)  measure  between  PDFs  and  use  a  minimax  approach.  We  will  see  that  it 
admits  a  simple  and  very  intuitive  solution  and  one  that  under  some  conditions 
coincides  with  the  EEF  for  model  order  selection.  The  KL  measure  (also  called 
the  divergence)  is  defined  as 

D(pr(t;0)||pT(t;/i,C))  =  [ pT( t;fl)ln  dt.  (25) 

J  Pt{  t;/x,C) 

For  the  case  of  two  multivariate  Gaussian  PDFs  with  t  ~  Ci)  and 

t  ~  A/’(/x2,  C2)  it  is  shown  to  be 

Dmih,  COlW^.Cy)  =  ilnj^j+ttr[C.(Cj1-Cr1)]+i((ll-^)3'C2-V2-/ll) 

(28) 

The  next  theorem  shows  that  the  minimax  approximating  PDF  to  Af(0,  ct2(HtH)  x) 
is  given  as  W( 0,  (1  +  ^2/p)cr2(HTH)“1  ).  It  is  interesting  to  note  that  the  same 
result  is  obtained  in  the  Bayesian  case  if  we  were  to  assume  Zellner’s  g-pnor 
for  0  .  In  this  case  the  prior  PDF  for  9  is  A7(0,  (£2/p)ct2(HtH)-1).  This 
is  undoubtedly  due  to  the  close  connection  between  minimax  and  Bayesian 
decision  theories. 

Theorem  1  (Minimax  approximation  to  PDF).  Assume  apx  1  random  vector 
T  is  distributed  according  to  pT  =  ff(0,  Ct)  where  the  mean  9  is  unknown  but 
lies  within  a  constraint  set  ©  =  {9  :  9TCfl9  <  £2},  and  the  covariance 
matrix  C(  is  known.  We  approximate  the  PDF  of  T  in  a  minimax  sense  using 
Pt  —  A/"(/x,  C)  for  some  /i,  C,  which  is  equivalent  to  solving  the  problem 

min  max  £>(jV(0,  Ct)||W(/i,  C))  (27) 

A4-0  0ee 

The  solution  of  (27)  is  pi*  =  0,  C*  =  ^1  +  C(. 

Proof.  See  Appendix  A. 

Having  obtained  a  suitable  replacement  for  the  PDF  of  the  sufficient  statis¬ 
tic  of  9,  which  does  not  depend  on  0,  we  can  finally  obtain  the  approximating 
PDF  for  the  original  data  x.  From  (24)  we  now  have 

t  ~  Afio^i+e/py^ar1) 

u  ~  Af( 0,  a2 IN-p) 
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and  noting  that  P/fH  =  0  and  PitB  =  B,  we  have  that 


X^PffX 


'  t  ' 

T 

HT  ' 

'  t  ‘ 

[OB] 

u 

BT 

u 

0  0 

0  I;v_p 


Also,  since  t  is  a  complete  sufficient  statistic  and  u  does  not  depend  on  6, 
and  hence  is  ancillary,  by  Basu’s  theorem  t  and  u  are  independent.  Hence,  we 
have  finally  from  (21)  and  (22)  after  transformation  to  t  and  u  that 

t  ~  A/"(0,<72(HtH)_1) 

u  ~  AT(0,  cr2Ijv-p)  (24) 


and  are  independent. 

For  inference  on  6  we  would  generally  discard  u  and  make  decisions  based 
on  the  sufficient  statistic.  This  assumes  that  we  wish  to  test  within  a  given 
PDF  family.  However,  when  choosing  between  models,  i.e. ,  between  PDF 
families,  it  is  just  the  opposite.  The  sufficient  statistic,  indicating  information 
about  the  model  parameters,  is  of  no  use  (when  they  are  unknown).  It  is 
actually  u  that  is  important  in  choosing  between  models.  The  main  difficulty 
in  just  discarding  t  is  that  the  remaining  data  vector  u  changes  in  dimension 
as  the  model  changes  and  so  any  decision  is  based  on  various  dimensionality 
data  sets.  This  will  lead  to  unacceptable  results.  In  other  words,  we  cannot 
simply  compare  p[/i(ui|A4i)’s  for  model  estimation  since  u^’s  may  have  different 
dimensions.  We  must  maintain  the  dimensionality  of  the  data  by  replacing  the 
PDF  of  t  by  one  that  is  known.  We  next  show  how  to  do  this. 

We  keep  the  part  of  the  PDF  of  (t,  u)  that  is  associated  with  u  and  attempt 
to  replace  the  part  associated  with  t  since  without  knowledge  of  0 ,  it  is  of  no 
use  for  model  estimation  as  discussed  at  the  end  of  the  previous  section.  Hence, 
the  problem  now  reduces  to  finding  a  suitable  approximation  or  estimate  of 
pr(t;0)  =  Af(0,  <72(HrH)_1)  =  Ct).  In  order  to  arrive  at  a  meaningful 
solution  we  will  need  to  constrain  the  space  of  9.  Otherwise,  our  approach 
is  not  viable,  as  will  be  shown  later.  Hence,  we  assume  that  9TCf16  = 

®  <  £25  which  is  the  interior  and  boundary  of  an  ellipsoid  in  RN ,  so  that 

the  possible  values  of  6  lie  within  O  =  {9  :  -  ”2 <  £2}.  In  many  cases  of 
model  estimation  we  may  already  have  some  idea  as  to  the  possible  limits  of  the 
parameters.  For  example,  O  can  be  considered  as  a  weighted  energy  constraint 
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which  is  often  encountered  in  practice  such  as  communications.  Hence,  such 
a  constraint  may  also  make  practical  sense.  With  this  restriction  we  let  the 
approximating  PDF  be  J\f(pi,  C)  and  choose  pi  and  C  to  best  approximate 
the  original  PDF  J\f(9,  cr2(HTH)_1).  To  do  so  we  adopt  the  Kullback-Liebler 
(KL)  measure  between  PDFs  and  use  a  minimax  approach.  We  will  see  that  it 
admits  a  simple  and  very  intuitive  solution  and  one  that  under  some  conditions 
coincides  with  the  EEF  for  model  order  selection.  The  KL  measure  (also  called 
the  divergence)  is  defined  as 

D(pT(t]0)\\pT(t-,n,C))  =  [ pr(t;fl)ln  dt.  (25) 

J  Pt{  t;pi,C) 

For  the  case  of  two  multivariate  Gaussian  PDFs  with  t  ~  J\f(pi1,Ci)  and 
t  ~  N(pi2i  C2)  it  is  shown  to  be 

DtV^T.COIIV^.Cj))  =  t  In  j^j  +  ltr  [C.CCp  -  Cr1)]+i(f.,-M2)TC2-1(M2-t*i) 

(26) 

The  next  theorem  shows  that  the  minimax  approximating  PDF  to  M{9,  er2(HTH)-1) 
is  given  as  M(0,  (1  +  £2/p)<72(HTH)~1).  It  is  interesting  to  note  that  the  same 
result  is  obtained  in  the  Bayesian  case  if  we  were  to  assume  Zellner’s  g-pnor 
for  Q  .  In  this  case  the  prior  PDF  for  G  is  A/XO,  (£2/p)c2(HTH)_1).  This 
is  undoubtedly  due  to  the  close  connection  between  minimax  and  Bayesian 
decision  theories. 

Theorem  1  (Minimax  approximation  to  PDF).  Assume  apx  1  random  vector 
T  is  distributed  according  to  Pt  =  A f(9,  Ct)  where  the  mean  0  is  unknown  but 
lies  within  a  constraint  set  0  =  {0  :  eTC tl9  <  £2},  and  the  covariance 
matrix  Ct  is  known.  We  approximate  the  PDF  of  T  in  a  minimax  sense  using 
Pt  =  W(/i,  C)  for  some  pi,  C,  which  is  equivalent  to  solving  the  problem 

minm&x  D(N(9.  Ct)\\Af(pi,  C))  (27) 

P  c  9ee 

The  solution  of  (27)  is  pi*  =  0,  C*  =  ^1  +  ^  Ct. 

Proof.  See  Appendix  A. 

Having  obtained  a  suitable  replacement  for  the  PDF  of  the  sufficient  statis¬ 
tic  of  9,  which  does  not  depend  on  9,  we  can  finally  obtain  the  approximating 
PDF  for  the  original  data  x.  From  (24)  we  now  have 

t  -  (1  +  £2/p)f72(HTH)~1) 

u  ~  W(0,  <j21N-p) 
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where  t  and  u  are  independent,  and  upon  transforming  back  to  x  via  x  = 

Ht  +  Bu  we  have  that  x  is  multivariate  Gaussian  with  a  zero  mean  vector  and 
covariance  matrix 

C  =  HCtHr  +  BC„BT 

=  H(l  +  <e2/p)a2(HTH)-1Hr  +  B(j2I^_pBT 
=  (1  +  $2/p)(j2Ph  +  <t2BBt. 

But  note  that  BBT  =  is  the  orthogonal  projection  operator  and 

so 

BBt  =  P£. 

The  covariance  matrix  becomes 

C  =  (l+^2/p)a2PH+a2Pj{  =  {l+£2/p)a2PH+a2(IN-PH)  =  a2IN+(£2 /p)a2PH 

and  is  seen  to  be  an  “inflated”  covariance.  With  the  above  analysis,  we  con¬ 
clude  this  section  with  the  following  theorem: 

Theorem  2  (Minimax  PDF  of  x  for  the  linear  model).  For  the  linear  model, 
the  minimax  PDF  of  x  is  J\f(0,a2l^  +  (f2  /p)er2P  H) ,  where  the  condition 

0  <  £2  determines  £2 . 

We  then  consider  a  signal  duration  estimation  problem  and  compare  the 
performance  of  the  MSD,  EEF  and  MDL.  Consider  a  model  estimation  prob¬ 
lem  with 


Mp  :  x[n] 


s [n]  +  w [n]  for  n  =  0, 1 , . . . ,  p  —  1 
w[n\  for  n  =  p, p  +  1 , . . . ,  iV  —  1 


where  s[n]’s  are  unknown  but  satisfy  f>2[n\/a'2  —  Vh2  =  h2p  with  known 

^2,  and  w[n]  is  the  white  Gaussian  noise  (WGN)  with  known  variance  a2. 
Here,  the  constraint  can  be  written  as  ]T)n=o  s2[n]/p  <  cr2^2,  which  is  a  power 
constraint  on  the  signal  and  makes  physical  sense.  Note  that  for  Mp,  the 
signal  length  is  p,  and  therefore,  this  is  a  signal  duration  estimation  problem. 
It  can  be  written  as  the  linear  model  with 


Hp  = 


0 


( N—p)xp 
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and  dp  =  [s[0]  s[l]  . . .  s\p  —  1]]T  with  constraint 

\\npep\\2 /a2  =  Y^s1[n\/a2  <  pf2 

n= 0 


The  MSD  criterion  can  be  simplified  as 

MSD(p)  =  ^  ^  “  2  ln  ^  +  ^ 

Let  the  number  of  samples  be  N  —  10,  the  largest  candidate  model  order  be 
M  =  3,  and  each  model  has  equal  prior  probabilities.  We  fix  a2f2  =  1  so 
that  the  size  of  the  constraint  balls  is  fixed.  We  plot  the  probability  of  correct 
selection  PG  versus  1/a2  in  Figure  3  (note  that  for  different  a2,  £2  is  different, 
since  cr2£2  is  fixed).  For  each  cr2,  the  experiment  is  repeated  50000  times.  For 
each  run,  sp  is  uniform  randomly  distributed  within  X^n=os2N  —  Pa2^2  =  P- 
The  EEF  and  the  MDL  are  also  simulated  for  comparison.  The  EEF  and  MDL 
rules  choose  the  model  order  that  maximizes  the  following  respectively: 

FFFfrti-f  21nLGP(x)-p[ln(21nIGp(x)/p)  +  l]  2  ln  LGp  (x)  >  p 
l  0  21nLGp(x)  <  p 

—  MDL(p)  =  2  ln  LGp (x)  —  pin  N 

for  p  =  1,2, ..  .  ,M,  where  21nLGp(x)  =  21n  ■  Here  0P 

is  the  MLE  for  6P  =  {s[0],  s[l], . . . ,  s\p  —  1]}.  We  can  see  in  Figure  3  that  the 
MSD  outperforms  the  EEF  and  MDL  for  all  SNR. 

4.2  An  Application  to  Classification 

To  use  the  MSD  for  this  classification  problem,  we  first  need  to  generalize 
Theorem  1  to  the  next  corollary. 

Corollary  1.  If  we  change  the  constraint  set  to  ©  =  {0  :  (6  —  9*)TCf1(9  — 
9*)  <  £2}  with  9*  known,  then  the  solution  of  the  minimax  problem 

minmax£>(AA(0,  Ct)\\J\f(pi,  C)) 

P-c  0ee 

\i'  =  o\  c*=(i  +  £)ct. 
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Figure  3:  Performance  of  the  MSD,  EEF,  and  MDL  for  estimation  of  signal 
duration  when  £2’s  are  known. 

Note  that  the  constraint  0  =  {9  :  (9  —  6*)TCJ1{0  —  6*)  <  £2}  also 
makes  practical  sense.  For  example,  in  radar  systems,  we  may  have  some 
prior  knowledge  that  the  possible  targets  are  within  a  certain  region. 

Now,  let  us  consider  a  classification  problem  where  we  have  two  models  (it 
easily  extends  to  multiple- model  case): 

Mi  :x  =  Si  +  w 
M2  :x  =  s2  +  w 

where  x  is  N  x  1,  w  ~  A/"(0,  a2I)  with  a2  known,  and  Si  and  s2  are  unknown 
but  with  some  partial  knowledge  that  they  satisfy  ||si  —  st||2/<r2  <  £2  and 
lls2  —  s2|| 2/cr2  <  £|.  st,  s2,  £2  and  £2  are  assumed  to  be  known.  Let  S\  =  {s!  : 
1 1 Si  —  st||2/cr2  <  £2}  and  S2  =  {s2  :  ||s2  —  S2 1 |2/cr2  <  ^,2}.  We  also  assume 
that  Si  and  S2  do  not  overlap.  It  follows  from  Corollary  1  that  the  minimax 
approximating  PDFs  of  A/’(s1,<72I)  and  M(s2,a2T)  are  Pi(x)  =  A/’(sJ,  (1  + 
£2/fV)<72I)  and  p2(x)  =  (1  +  £\/N)o2 1),  respectively.  As  a  result,  we 

decide  Mi  if 

lnpi(x)  >  In  p2(x) 
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or  equivalently  if 


-N]n(l  +  $/N) 


(x-st)T(x-st) 
(1  +ejN)a* 


(x  -  s^)T(x  -  s*2) 

(1  +e2/N)a* 


Note  that  if  £2  =  £2>  if  is  the  usual  ML  rule  for  s *  =  s*.  In  order  for  comparison, 
we  next  derive  the  pseudo-ML  rule  for  this  case.  The  pseudo-ML  rule  decides 
Ali  if 

max  p(x;  Si |  All)  >  maxp(x;  S2IA42) 

sieSi  s2es2 


or  equivalently 


!|x  —  si||  <  ||x  —  s2|| 


Note  that  minimizes  the  norm  ||x  — Sj||.  Therefore,  if  ||x  —  s*\\2/a2  <  £f  for 
i  —  1  or  2  (since  51  and  S2  do  not  overlap),  then  st  =  x,  ||x  —  Sj||  =  0,  and  the 
pseudo-ML  rule  decides  Ah  if  x  is  within  the  ith  ball.  If  ||x  ~  S*||2/cr2  >  £2 
for  i  =  1,2,  s,-  must  be  at  the  intersection  of  the  line  connecting  x  and  s*  and 
the  boundary  of  the  ball  Si  for  i  =  1,  2  as  illustrated  in  Figure  4.  Therefore, 


Figure  4:  Solution  of  s *  if  ||x  —  s*||2/a2  >  £2  (x  is  outside  the  ball)  for  i  =  1, 2. 
1 1 x  —  Si||  =  ||x  —  s* 1 1  —  <h<7  for  i  =  1,  2.  The  pseudo-ML  rule  decides  Afi  if 

llx  -  sill  -  fa  <  iix  -  S2I1  -  i-io 

Finally,  we  simulate  the  classification  problem  considered  in  this  subsection. 
Let  N  =  10,  st  -  [1  1  ...  if,  s$  =  [— 1  -  1  . . .  -  1]T,  Ci  =  0.8,  £2  =  1.2, 
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and  M  1  and  M2  have  equal  prior  probability.  We  plot  the  probability  of 
correct  classification  Pc  versus  a2.  For  each  a2,  the  experiment  is  repeated 
50000  times.  For  each  run,  Si  or  s2  are  uniform  randomly  distributed  within 
{si  :  | |si  —  s*||2/cr2  <  £2}  and  {s2  :  ||s2  — s^^/cr2  <  £f},  respectively.  In  Figure 
5,  we  can  see  that  the  MSD  outperforms  the  pseudo-ML  rule,  especially  when 
<72  is  large. 


Figure  5:  Classification  performance  of  the  MSD  and  the  pseudo-ML  rule. 


5  The  General  Pythagorean  Theorem  for  the 
EEF 

With  one  sensor  T\(x),  we  construct  the  EEF  as 

Pm (x)  =  exP  [mTiix)  -  Kw(r}i)  +  lnpo(x)] 

The  true  PDF  pt(x)  is  unknown,  but  the  moments  Et  ( Tj(x ))  =  A i  are  known. 
In  order  to  find  rft,  we  minimize  D  (pt\\pm)  =  Pi  Et  (Tl(x))  —  which  is 

equivalent  to  solving  Et  (Ti(x))  =  EVl  (Ti(x)).  Let  the  solution  be  ,  and 
it  satisfies 

Et  (Ti(x))  =  ^(l,.  (Ti(x))  =  Xi  (28) 

Next,  with  two  sensors  Ti(x)  and  T2(x),  the  EEF  is 

PmUW  =  exP  [PiTi(x)  +  P2 T2(x)  -  K(2)(^,  rj2)  +  lnpo(x)] 
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Let  the  solution  be  and  rj^* ,  and  similarly,  they  satisfy 


Et  (Ti(x))  =  £,(2)^(2).  (Ti(x))  -  Aj 

(29) 

£t  {T2{'x))  =  E  m.  (2).  (T2(x))  —  a2 

M2 

(30) 

Then  we  have  the  following  theorem: 

Theorem  3  (General  Pythagorean  theorem  for  the  EEF.). 

D{Pt\\po)  =  D  +  D  (?„<»>*,,,<»>•  IIp„<»*)  +  D  (p^n. | |p0)  (31) 

Proof.  See  Appendix  B. 


6  Sensor  Selection  for  a  Multipath  Scenario 

Assume  that  we  transmit  a  signal  s[n\  for  0  <  n  <  N  —  1.  Due  to  the  multipath 
propagation,  the  received  signal  is  modeled  as 


x[n }  =  s[n]  *  h[n\  +  w[n ] 


(32) 


where  h[n ]  is  the  impulse  response  of  the  multipath  model,  and  w[n]  ~  A/”(0,  a2) 
is  white  Gaussian  noise.  Here  we  assume  that  h[n]  is  nonzero  for  0  <  n  < 

M  —  1,  and  therefore,  we  collect  the  received  signal  x [n]  for  0  <  n  <  N+M— 2. 
r  t  T 


Let  s,  = 


0.  .^Q  s[0]  ...  s[N  —  1]  0  ...  0 


be  an  (N  +  M  —  1)  x  1  vector 


L  i  0’s  J 

for  i  =  0,  M  —  1.  Then  the  received  signal  model  in  (32)  can  be  written  as 


M— 1 

x  =  V  h[i]si  +  w  =  [  So  Si 

i=i  s 


s 


S/W-l 


h[0] 

Ml] 


[  h[M  -  1] 

' - V - " 

h 


+W 


Note  that  Sh  is  a  linear  combination  of  Sj’s. 

Assume  that  we  have  M  sensors  whose  outputs  are 

Ti(x)  =  sfx 
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for  i  —  0, 1, . . . ,  M  —  1.  We  also  assume  that  the  means  of  the  sensor  outputs 
are  known,  i.e. ,  E  [2~i(x)]  =  sfSh  is  known  for  i  —  0, 1, . . . ,  M  —  1. 

Procedure: 

In  step  1,  we  choose  the  sensor  with  the  smallest  KL  divergence  D(pt\\pv( u)  or 
equivalently  the  longest  projection  of  Sh  onto  the  subspace  generated  by  s*. 
Denote  this  sensor  as  T^{x)  and  corresponding  vector  as  s^b 

In  step  2,  we  choose  from  the  remaining  sensors  with  the  smallest  KL 
divergence  D(pt ||p  <2))  or  equivalently  the  longest  projection  of  Sh  onto  the 
subspace  generated  by  {s^bs,;}.  Denote  this  sensor  as  P(2,(x)  and  corre¬ 
sponding  vector  as  sl2b  The  selected  set  of  sensors  is  denoted  as  = 
{Tbl(x), T^(x)},  and  the  corresponding  set  of  vectors  is  denoted  as  = 
{s^bs^}. 

In  step  i,  we  choose  from  the  remaining  sensors  with  the  smallest  KL 
divergence  D(pt\\pv(i))  or  equivalently  the  longest  projection  of  Sh  onto  the 
subspace  generated  by  {S(’~rb  s, },  and  so  on. 

Simulation  Results: 

Let  s[n]  =  cos(27r (f0n  +  \kn2))  for  0  <  n  <  N  —  1  where  /0  =  0.1,  k  =  0.002, 
and  N  =  100.  Let  M  =  20  and  the  impulse  response  of  the  multipath  model  is 
plotted  in  Figure  6.  Note  that  h[ 3],  h[9],  and  h[14]  has  the  largest  magnitudes, 


Figure  6:  Impulse  response  of  the  multipath  model, 
and  therefore,  we  expect  to  select  S3,  S9,Si4  in  the  first  3  steps.  The  resulting 
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order  of  the  selected  sensors  is 

3, 9, 14, 1, 18, 0, 19, 16,  5, 7,  2, 12, 10, 4, 17, 15, 8, 6, 13, 11 

We  plot  the  KL  divergence  between  the  true  PDF  and  the  EEF  in  the  tth  step 
in  Figure  7.  We  can  see  that  the  KL  divergence  is  close  to  zero  at  step  10, 
which  implies  that  we  may  only  need  the  first  10  of  selected  sensors  to  have 
the  same  performance  as  using  all  sensors. 

800  —y - r  - - - 

700 

600  - 

500  • 

8  1 
§i  400 ' 

$ 

S  300  r  i  - 

* 

200  \ 

100-  \ 

-100 - 1 - 1 - L - 

0  5  10  15  20 

Step  i 


Figure  7:  KL  divergence  between  the  true  PDF  and  the  EEF  in  the  ith  step. 


7  Decision  Combination  with  Multiple  Learn¬ 
ing  Models/Hypotheses 

We  consider  the  combination  methods  that  are  based  on  estimates  of  posterior 
probabilities.  Consider  a  training  data  set  Dtr  with  m  instances,  which  can 
be  represented  as  {xq,  yq },  q  =  1, . . . ,  m,  where  xq  is  an  instance  in  the  n  di¬ 
mensional  feature  space  X ,  and  yq  eY  =  {1, . . . ,  C}  is  the  class  identity  label 
associated  with  xq.  Through  a  training  procedure,  such  as  bootstrap  sampling 
or  subspace  methods,  one  can  develop  L  classifiers,  hj,  j  =  1  There¬ 

fore,  for  each  testing  instance  xt  in  the  testing  data  set  Dte ,  each  classifier  can 
vote  an  estimate  of  the  posterior  probability  across  all  the  possible  class  la¬ 
bels,  Pj(Yi\xt),  j  =  1, . . . ,  L  and  Yi  =  1, . . . ,  C.  Based  on  the  Bayesian  theory, 


21 

Approved  for  public  release;  distribution  unlimited. 


given  the  measurements  Pj(Yi\xt),  where  j  =  1, . . . , L  and  Yj  =  1 ,C,  the 
testing  instance  xt  is  assigned  to  Yj  provided  that  the  posterior  probability  is 
maximum.  The  Bayesian  decision  rule  illustrates  that  it  is  critical  to  compute 
the  probabilities  of  various  classifiers  with  the  consideration  of  all  measure¬ 
ments  simultaneously  in  order  to  fully  utilize  all  available  information  to  reach 
a  prediction.  The  most  commonly  adopted  combining  rules  in  the  literature 
include  geometric  average  rule  (GA-rule),  arithmetic  average  rule  (AA-rule), 
median  value  rule  (MV-rule),  majority  voting  rule  (MajV-rule),  Borda  count 
rule  (BC-rule),  max  and  min  rule,  weighted  average  rule  (Weighted  AA-rule) 
and  weighted  majority  voting  rule  (Weighted  MajV-rule).  The  objective  of 
our  research  on  this  is  to  find  a  combining  rule  for  an  improved  estimation 
of  the  final  posterior  probability,  P(Yj \xt),  based  on  the  individual  Pj(Yi\xt) 
from  each  classifier  hj. 

In  this  work,  we  consider  a  general  ensemble  learning  scenario  as  illustrated 
in  Fig.  8.  We  consider  the  ensemble  system  including  L  hypotheses  (i.e., 
classifiers  in  our  current  work),  each  associated  with  a  signal  strength  Sj  as  a 
criterion  related  to  the  posterior  probability  Pj(Yi\xt).  We  also  introduce  a 
related  concept,  uncertainty  degree  nj,  j  =  1, . . . ,  L.  For  instance,  in  a  two- 
class  classification  problem,  Pj  =  0.5  represents  the  lowest  certainty,  meaning 
that  out  of  the  two  classes  each  one  is  equally  likely.  On  the  other  hand,  Pj  =  0 
or  Pj  —  1  represents  full  certainty,  meaning  that  the  hypothesis  is  certain  about 
the  class  identity  label.  In  multiclass  classification  problems,  given  a  class  label 
Yj,  the  predicted  label  yt  of  any  testing  instance  Xt  can  be  represented  as  a 
Boolean  type:  yt  =  Yj  or  yt  e  Y j,  where  Y t  =  {Yk,k  ^  i}.  In  this  way,  the 
multiclass  classification  problem  can  also  be  transformed  analogous  to  a  two- 
class  problem.  To  this  end,  the  signal  strength  can  be  represented  as  \Pj  —  0.5|, 
whereas  the  uncertainty  degree  is  0.5  —  \Pj  —  0.5 j. 

The  signal  strength  Sj  and  the  uncertainty  degree  rij  can  be  used  to  repre¬ 
sent  the  knowledge  level  of  the  hypothesis  j.  In  our  approach,  higher  weights 
are  assigned  to  the  classifiers  that  have  higher  signal  strengths  and  lower  uncer¬ 
tainty  degrees,  i.e.,  are  more  certain  on  their  decisions,  whereas  lower  weights 
are  assigned  to  those  classifiers  that  have  lower  signal  strengths  and  higher 
uncertainty  degrees,  i.e.,  are  less  certain  on  their  decisions.  To  do  so,  the 
weights  uij  should  be  proportional  to  the  signal  strength  to  uncertainty  degree 
ratio  3j  as 

uij  oc  /3j  =  — ,  (33) 

no¬ 
where  Sj  —  | Pj  —  0.5 1,  and  n3  =  0.5  —  sj.  This  provides  the  foundation  of  our 
proposed  SSC  approach. 
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Figure  8:  An  ensemble  learning  system  with  multiple  classifiers. 


Fig.  9  gives  an  example  of  an  ensemble  system  with  four  classifiers  for  dif¬ 
ferent  combining  rules.  Assume  that  we  have  obtained  the  weight  coefficients 
(for  weighted  AA  rule  and  weighted  MajV  rule)  and  the  decision  profile  for  the 
testing  example  xt ■  From  Fig.  9  one  can  see,  in  this  particular  example,  the 
MajV-rule  and  weighted  MajV-rule  vote  this  testing  instance,  xt,  as  a  class  2 
label.  For  the  MV-rule,  because  the  votes  for  class  1  and  2  are  the  same,  the 
final  predicted  label  can  be  randomly  selected  from  these  two  classes.  For  the 
BC  rule,  the  final  predicted  label  can  be  randomly  selected  from  classes  1,  2, 
and  3.  All  other  methods  vote  this  testing  instance  as  a  class  1  label. 

To  analyze  the  characteristics  of  our  approach,  we  adopted  the  margin 
analysis  to  investigate  the  classifier  decision-making  process. 

Definition  1:  Consider  a  classification  problem,  the  classification  margin 
on  an  instance  is  the  difference  between  the  weight  assigned  to  the  correct 
label  and  the  maximal  weight  assigned  to  any  single  incorrect  label,  i.e.,  for 
an  instance  {x,y}  , 

margin(x)  =  wh{x)=y  -  max{^(a.)^}  (34) 

Definition  2:  Given  a  data  distribution  D ,  the  margin  distribution  graph 
is  defined  as  the  fraction  of  instances  whose  margin  is  at  most  A  as  a  function 
of  A  e  [-1,1]: 

(35) 
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Figure  9:  Exemplary  comparison  of  the  SSC  and  other  combining  methods. 

where  D\  =  {a;  :  margin(x) 

• 

tC 

VI 

stands  for  the 

size  operation  and 

F( A)  e  [0, 1]. 

Based  on  this,  we 

conducted  the  margin  analysis  for  the  SSC  method  in 

comparison  with  several  other  method.  Fig.  10  shows  an  example  of  the  margin 
distribution  graph  for  the  proposed  method  with  respect  to  the  AA-rule  and 
MV-rule.  It  is  clearly  shown  that  the  proposed  SSC  method  can  achieve  a  high 
margin  in  this  case.  For  instance,  as  illustrated  by  the  dash-dotted  line,  for 
the  proposed  method,  there  are  72.06%  of  the  testing  data  with  margin  less 
than  0.5,  whereas  for  the  MV-rule  and  AA-rule,  there  are  97.47%  and  99.64% 
of  the  testing  data  with  a  margin  less  than  0.5,  respectively. 
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Figure  10:  Margin  distribution  graph  analysis. 


8  PDF  Estimation  over  Data  sStream  by  Se¬ 
quences  of  Self-Organizing  Maps  (SOM) 

The  main  idea  of  the  work  is  to  build  a  series  of  SOMs  over  the  data  streams  via 
two  operations,  i.e.,  creating  and  merging  the  SOM  sequences.  The  creation 
phase  produces  SOM  sequence  entries  for  windows  of  the  data,  which  obtains 
clustering  information  of  the  incoming  data  streams.  The  size  of  the  SOM 
sequences  can  be  further  reduced  by  combining  the  consecutive  entries  in  the 
sequence  based  on  the  measure  of  Kullback-Leibler  divergence.  Finally,  the 
probability  density  functions  (pdfs)  over  arbitrary  time  periods  along  the  data 
streams  can  be  estimated  using  such  SOM  sequences.  We  have  developed  the 
system-level  architecture,  learning  and  estimation  algorithm,  and  simulated 
the  approach  in  Matlab.  We  have  also  compared  our  approach  with  two  other 
KDE  methods  for  data  streams,  the  M-kernel  approach  and  the  cluster  kernel 
approach,  in  terms  of  accuracy  and  processing  time  for  various  stationary  data 
streams,  including  non-stationary  data  streams. 

The  system  level  architecture  is  illustrated  in  the  Fig.  11. 

Given  a  d-dimensional  data,  aq  =  (x-,xf, . . . ,  xf)  €  Xj  C  5Rd,  each  neuron 
rii  in  the  SOM  is  associated  with  a  d-dimensional  feature  vector  or  weight, 
a >i  =  . . .  ,uf).  SOM  learning  include  three  states: 

Competition  stage:  determining  the  winning  neuron  as  the  neuron  nw(t) 
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Figure  11:  SOMKE:  system  architecture  of  the  SOM  based  KDE  method. 

with  the  smallest  distance  (maximum  similarity)  to  the  input  vector 

ujw(t)  =  argmin  ||a;t -^11 ,  i  =  (36) 

U >i 

Cooperation  stage:  a  topological  neighborhood  function  around  the  win¬ 
ning  neuron.  Example:  Gaussian  function  h,(t): 

Kit)  =  exP i  =  (37) 

where  d?  is  the  squared  distance  on  the  grid  of  nodes  between  nw  and  nu 
and  a(t)  is  the  effective  width  of  the  topological  neighborhood  with  the  initial 
value  ctq. 

Adaptation  stage:  Weight-updating  rule: 

ujiit  +  1)  =  u>i(t)  +  r)(t)hi(t)(xt  -  uj.it))  (38) 

where  rj(t)  is  the  learning  rate: 

v(t)  =  bo  exp(  — . -),  t  =  0,1,2,...  (39) 

~2 

where  r2  is  a  time  constant. 
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Once  the  SOM  sequences  are  developed  along  the  stream  data,  one  can 
merge  them  based  on  estimated  KL  divergence  (minimal  KL  divergence): 

Dkl(P\\Q)=  [  P(x)hg^dx.  (40) 

Jytd  Q\x) 

Fig.  12  illustrates  an  example  of  the  combination  of  two  SOMs,  i.e., 
SOM j  and  SOMj+i  combine  to  SOMc. 


SOU  :r  =  f/,;  +  |) 


SOM  :r  =[/./)  SOM,.,  r.  =[>  +  !, >-l) 


Figure  12:  An  example  of  combination  of  two  SOM  sequence  entries. 

Two  merging  strategies  are  developed: 

-  Fixed  Memory  Strategy:  Fix  the  total  amount  of  memory  allocated  to 
store  the  SOM  sequence; 

-  Fixed  Threshold  Strategy:  Minimize  the  overall  KL  divergence  of  all 
consecutive  pairs  of  entries  in  the  SOM  sequence  based  on  a  threshold  a. 

9  EEF  for  Gaussian  Distribution  Classifica¬ 
tion 

Based  on  the  EEF  theory  that  the  team  developed,  we  also  analyzed  and  de¬ 
veloped  specific  classification  techniques  based  on  the  EEF  during  this  project 
period.  For  the  constructed  joint  PDF  Pr{t,rj),  we  define  l(rj)  as  the  log- 
likelihood  function 

Kv)  =  lnpT(t,  77)  (41) 
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Given  a  set  of  training  data,  we  can  estimate  the  natural  parameter  vector  77 
with  MLE  algorithm 


77  =  argrnax  l{rf) 


(42) 


Then,  we  can  write  the  constructed  joint  PDF  of  measurements  under  Hi  as 


PT{t,f]i)  =  exp  [<  fii,t  >  -K0(f}i)  +  \npT(t,Ho)}  (43) 

Since  the  constructed  joint  PDF  is  parameterized  by  the  natural  parameter 
vector  77  in  EEF,  we  consider  the  px(t,  1li)  equal  to  the  estimation  of  pT(t,  Hi) 
under  Hi. 

Given  unknown  testing  data  ts  to  be  classified,  similar  to  the  MAP  estima¬ 
tor,  we  assign  the  class  Hi  to  t^  if  +  In  £>(77^  is  maximized  over  i,  where 
pirfi)  which  equals  to  p(Hi)  describes  the  prior  probability  of  M  candidate 
hypotheses.  Under  the  assumption  of  equal  prior  probability  of  M  candidate 
hypotheses,  such  that 

mi)  =  p{n2)  =  ■  •  •  =  p(Hm)  =  4  (44) 

the  target  function  of  our  classifier  rule  is  built,  written  as 

>-*««,)  <45> 

Thus,  in  the  training  process  of  our  new  classifier  rule,  the  natural  param¬ 
eters  of  constructed  joint  PDF  are  firstly  estimated  with  MLE  from  the  sets 
of  training  data  under  each  hypothesis.  Then,  the  constructed  joint  PDF  for 
each  hypothesis  is  available  with  the  estimated  natural  parameter  vectors  77, 
In  the  testing  process,  the  unknown  testing  data  can  be  classified  according  to 
the  target  function  built  in  Eq.  (45),  which  is  very  similar  to  the  MAP  rule. 


9.1  Hypothesis 

We  analyze  and  apply  our  EEF  based  method  to  Gaussian  process  classifica¬ 
tion.  Compared  to  the  classic  expectation  maximum  (EM)  rule  for  classifica¬ 
tion,  our  new  classification  rule  constructs  one  new  joint  PDF  px(t,  77)  with 
exponential  form  based  on  EEF  and  builds  new  target  function  for  classifica¬ 
tion. 

Considering  the  following  hypotheses  for  Gaussian  process  classification 
with  unknown  expected  mean  vector  and  unknown  covariance  matrix: 

H0  :  x  =  s0  +  w0  (46) 
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where  the  Ho  is  the  reference  hypothesis  and  wo  denotes  Gaussian  noise  i.e. 
w0  ~  A/"(0,  £0).  The  data  So  is  named  as  reference  data  which  satisfies 
So  ~  M(0, 1)  where  I  is  identity  matrix.  The  meaning  of  s0  is  shown  later 
in  this  section.  The  joint  PDFs  of  M  candidate  hypotheses  satisfy  Gaussian 
models  i.e.  we  have  s,  ~  A £,),  i  =  1, 2,  •  ■  •  ,  M  where  both  and  S;  are 
unknown.  The  vector  of  x  denoted  as  [:ri, . . . ,  xp}T  is  one  data  with  p  features 
or  attributes.  So,  s,  and  Wo  are  all  p  x  1  vectors. 

In  our  approach,  the  new  joint  PDF  with  EEF  is  constructed  based  on 
the  joint  PDF  under  reference  hypothesis.  Here,  we  firstly  assume  that  the 
covariance  matrix  of  noise  So  is  reasonable  measured  and  then  the  joint  PDF 
of  noise  could  be  known.  To  simplify  our  derivation,  we  also  assume  that  the 
Gaussian  process  is  one  i.i.d.  process,  and  that  all  the  feature  variables  are 
conditional  independent  with  each  other.  By  introducing  the  reference  data  s0, 
we  have  H,  =  Ho  by  choosing  fii  =  0  and  Sj  =  I.  Hence,  all  distributions  of 
hypotheses  could  be  embedded  into  one  exponential  family  distribution,  that 
is  the  underlying  idea  of  EEF. 


9.2  Joint  PDF  Construction  with  EEF 

For  multivariate  Gaussian  process  distribution,  since  the  mean  vector  and  co- 
variance  matrix  are  unknown,  there  is  one  pair  of  minimal  sufficient  statistic 
which  are  the  estimation  of  mean  vector  and  covariance  matrix  with  MLE.  Un¬ 
der  the  assumption  that  all  the  feature  variables  are  conditional  independent 
with  each  other,  the  pair  of  minimal  sufficient  statistic  is  written  as: 


T  =  t(x)  = 


where  x  and  v  are  the  mean  vector  x*,  =  ^Z^Xj/iV  and  the  diagonal  vector 
of  covariance  matrix  £  =  5Zi=1(x,-  —  x)(xj  —  x)T/N  respectively.  Also  both  x 
and  v  are  the  p  x  1  vectors,  i.e.  x  =  [ofi,  tr2,  ■  ■  •  ,  xp]T  and  v  =  [sf ,  s|,  •  •  •  ,  Sp]T. 
N  denotes  the  number  of  samples  in  the  Gaussian  process.  Since  all  the  feature 
variables  x  =  [x\,  x2,  ■  ■  ■  ,  xp]T  satisfy  Gaussian  distribution,  it  is  easily  shown 
that  x,  v  are  independent  and  their  joint  PDFs  for  Ho  are 

x~  Af(0,^)  (49) 
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which  satisfies  Gaussian  distribution,  and 


\p(N- 1)/2  p 


p(v)  =  M  Tm^TYtT  n 


r{W 


k= 1 


Sk 


N- 3 


exp 


2  ^<7? 
fc=l  K 


(50) 


respectively,  where  Eq  =  Eq  + 1  and  cr^  is  the  k- th  element  of  diagonal  matrix 


■'o- 


Thus,  according  to  the  Eq.  (43),  the  constructed  joint  PDF  with  EEF  for 
Gaussian  process  is  stated  as: 


Pt( t,  i?)  =  exp  [<  r7,  t  >  —K0(rf)  +  lnpo(t)]  (51) 

where  rj  =  [?7f,r7^]r  in  which  both  rj1  and  t]2  are  p  x  1  vectors,  and  t(x)  is 
shown  in  Eq.  (48).  Also,  it  is  shown  in  Appendix  that 


p  „  2  2  /V  _  o  P 

Ko{n)  =  ^  +  A0  -  —j—  ln  (n  ~  2 <rlv2,k)  (52) 

fc=i  fc=i 

where  Ao  is  constant  term. 

Therefore,  the  natural  parameter  vector  rji  under  each  hypothesis  Hui  = 
1, 2,  ■  •  ■  ,M  can  be  estimated.  We  have 


9Kq (rhAh) 
drji 

dKojri^rti) 
drf  2 


(53) 


and 


^  _  N  N  —  3 

k  -  ~  2.»f 


(54) 


Given  one  testing  process  Xi,  •  •  •  ,x^  to  be  classified,  We  decide  Hi  for  which 
the  following  is  maximum  over  i: 


where  the  constant  term  is  omitted. 
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9.3  Simulation  for  Gaussian  Process  Classification 


In  this  part,  we  compare  the  classification  performance  of  our  rule  with  Pseudo- 
MAP  for  Gaussian  process  classification  problem.  Besides  that,  both  of  results 
are  also  compared  with  the  MAP  rule  in  which  the  true  parameters  are  as¬ 
sumed  to  be  known.  For  the  models  shown  in  Eq.  (46)  and  Eq.  (47),  the 
Pseudo-MAP  method  estimates  the  source  parameters  (xt  as  /k,  and  E,  as  E, 
with  MLE  algorithm  under  each  hypothesis  Hi,  i  =  1, 2,  •  •  •  ,  M  from  training 
data,  and  assumes  px(t,  Hi,  ^n)  *s  equal  to  the  estimation  of  px(t,  %i).  Hence, 
in  the  Pseudo-MAP  method,  the  testing  data  x  is  assigned  the  class  label  when 
the  following  target  function  is  maximum  over  i: 


lnpx(x,AJ,Et)  =  -^(x-  Ai)TSi  !(x-  fc) 
-  -  In  jdet  ^27rEj^ 


(56) 


We  choose  Eq.  (56)  and  Eq.  (55)  as  the  target  functions  for  Pseudo-MAP 
rule  and  our  classifier  rule  respectively.  There  are  M  —  3  classes  with  p  =  5 
attributes,  and  Ns  =  1000  training  processes  for  each  class.  Each  process  has 
N  =  25  samples.  Let  st  ~  A/"(/q,  E *)  and  u  ~  A/"(0,  <r2I),  we  have 


Hi  =  [5,  5, 5, 5,  5]  Ej  =  201 

H2  =  [5,  6, 6, 5,  5]  S2  =  211 

p3  =  [5,5,5,6,6]  E3  =  221  (57) 

where  a2  is  known,  but  the  source  parameters  p*,  Ej,  i  =  1, 2,  •  •  •  ,  M  in  Gaus¬ 
sian  distributions  are  all  unknown.  The  probabilities  of  correct  classification 
(Poc)  of  both  rules  versus  cr2  are  shown  in  Fig.  13.  Monte  Carlo  method  is  used 
in  this  simulation  in  which  each  result  is  averaged  over  2000  experiments  for 
every  a2.  It  is  shown  that  the  classification  performance  of  our  classifier  rule 
is  very  close  to  the  performance  of  MAP  rule.  Since  the  construction  of  joint 
PDF  EEF  in  our  classifier  rule  is  based  on  the  joint  PDF  under  the  reference 
hypothesis  and  needn’t  the  same  information  of  training  data  as  MAP  rule, 
we  evaluate  the  influence  of  insufficient  training  data  for  both  rules.  Fig.  14 
shows  the  simulation  results,  in  which  we  compare  the  classification  perfor¬ 
mances  of  our  classifier  rule  and  the  MAP  rule  versus  the  number  of  training 
data.  It  states  that  our  rule  has  much  better  classification  performance  than 
the  MAP  rule  as  the  number  of  training  data  decreased.  That  means,  for  the 
case  of  insufficient  training  data,  our  classifier  rule  reduces  significantly  the 
severity  of  decreased  classification  performance. 
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Figure  13:  The  probability  of  correct  classification  for  our  rule  based  on  EEF 
and  MAP  rules 
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Figure  14:  The  impact  of  decreasing  training  data  for  our  rule  and  MAP  rules 
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A  Proof  of  Theorem  1 

In  this  Appendix,  we  prove  that  /r*  =  0,  C*  —  ^1  +  C*  is  the  solution  of 
the  following  minimax  problem: 

min  max  D(J\f(9,  C*)||JV(/a,  C)) 

W  9e@ 

where  the  px  1  mean  9  is  unknown  but  lies  within  a  constraint  set  ©  =  {9  : 
9tC;19  <  £2},  and  the  covariance  matrix  Ct  is  known. 

Proof.  From  (26),  we  have 

D(V(8, C,)||V(m, C))  =  \ln  |Q[+^r  [Cl(C"‘  -  cr')]+^(9-f )TC-\t-n) 

(58) 
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Assuming  that  C  and  Ct  are  positive  definite  so  that  the  eigenvalues  XfiC)  >  0 
and  Aj(Ct)  >  0  for  i  =  1, 2, . . . ,  p.  It  is  well  known  that  there  exists  a  full  rank 
p  x  p  matrix  V  such  that 

VTCtV  =  I  (59) 

VTCV  =  A  (60) 


where  A  is  a  diagonal  matrix  with  positive  elements,  and  V  depends  on  C 
(since  Ct  is  known)  but  not  /x.  Now  since  C  =  (VT)_1AV_1,  it  follows  that 
C_1  =  VA_1Vt  and  therefore, 

D(Af(0,  COIM/x,  C))  =  In  |C*|  +  ^  In  |(VT)-1AV-1| 

+  ^r[Ct(VA-1VT-C-1)] 

+  ~2(e-^)TVA-lvT(e-n)  (6i) 

Note  that  it  follows  from  (59)  that 


Ct  =  (V^^V"1  (62) 

In  |Cf|  =  In  |(VT)~1V~1|  (63) 

and 

-lln|C,|  +  lln|(VT)-1AV-1| 

=  -|ln  |C,I  +  i  ln|.(VTr‘V~1.l  +  i  In  |A| 

Ct 

=  ^ln|A| 

Also, 

tr(CtVA-]VT)  =  tr(VTCtV  A-1)  =  trf  A'1)  (64) 

^  . 1  ^ 

I 

Now  let  O'  =  VT0;  pd  =  VT/z  Note  that  O'  and  n'  depend  on  C  since  V 
depends  on  C.  77je  AL  divergence  can  be  written  as 


1.  ...  1.  P  1 


i5(Af(9,Ct)||Af(M,C))  =  i!n|A|  +  iMA-1)  -  jj  +  i(@'  -  -  A) 


1  ^  1  1  p  1  (0-  -  Pt)2 

i=l 


1=1 


(65) 


i=l 
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where  \’s  are  the  diagonal  elements  of  A. 

Note  that  C  enters  into  the  KL  divergence  via  A,  ’s  and  implicitly  in  O'  and 
fj,'.  For  a  fixed  C,  however,  we  can  optimize  over  O’  and  pd .  To  simplify  the 
optimization,  let  a*  =  1/A j  >  0  for  i  =  1,2 , ,p.  The  KL  divergence  is 


D(Ff(0,Ct)\\J\f(n,C))  =  J{0'  ,p!,a) 

—b\  B* 

2= 1 


In  a*)  +  -  a'(9'i  ~  F'i?  (66) 


i= 1 


Now  we  determine 

min  max  J{0',  n!  a)  (67) 

a,/X'  e' 

subject  to  0TCf^0  <  £2.  Since  Cfl  =  VVr,  in  the  O'  space  the  equivalent 
constraint  is  0^0'  <  £2 . 

First  consider  maximizing 


gi(O')  =  ^2ai(0'i-  p-)2  (68) 

2=1 

over  O' .  But 

{01-  T'i?  <m  +  \T'i\)2  (69) 

with  equality  holds  for  9[  =  —sgnfpff) \6[\  where  sgn(-)  is  the  sign  function. 
Therefore,  we  can  equivalently  maximize 


ju(ie'l)  =  j>(i«5i  +  Mi)s 


i=l 


over  |0/| .  For  fixed  a,  note  that 


min  max  >  <^(10(1  +  Ip'Af 

v  \0'\  jri 

>  maxmin^aid^l  +  \p'i\)2 
I®  I  '  i= i 

p 

—  max  aiOf 

o'  o,T  o'  <e  h 

p 

=  max  >  aS'f 

0':0'T0'=?i^ 

—  ^maxs 


(70) 


(71) 
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where  araax  =  max{ai,a2, . . .  ,ap}.  Equality  holds  if  and  only  if  pi'  =  0, 
O'  =  [0  •  •  •  0  £  0  •  •  •  0]T  where  the  position  of  the  only  non-zero  element 
f  corresponds  to  that  of  amax.  As  a  result,  for  fixed  a, 


min  max  J(6' ,  u!  ,a) 
t1'  O' 


V-  + 

2  2 


1  p 

9& 


j=i 


In  af)  +  -an 


(72) 


It  only  remains  to  minimize  /(a)  =  (°i  —  In  ai)+amaxf2  over  a  with  a,-  >  0 

for  all  i. 

First  note  that  a*  —  In  a*  is  a  strictly  convex  function  in  a*  and  hence 
~  Inaj)  is  strictly  convex  in  a.  Also  amax  =  max{a1;  a?,  ■  ■  ■ ,  ap} 
is  convex.  Thus,  /(a)  is  strictly  convex  and  a  symmetric  function  of  the 
ai ’s.  If  a  minimum  exists,  it  is  unique.  Hence,  the  minimum  must  be  at 
a i  —  a2  =  •••  =  ap,  since  otherwise  any  permutation  will  produce  the  same 
value,  violating  uniqueness.  Letting  a  =  a\  =  =  ■  •  •  =  ap,  the  objective 

function  is 

/(a)  =  p(a  —  In  a)  +  a£2  (73) 

Taking  the  derivative  with  respect  to  a  and  setting  it  to  zero,  a  satisfies 

p--+?= o 

a 

producing  a*  =  i+jjwu-  From  (72),  we  have 


max  J {O' ,  pi',  a)  = 

a,//'  Q’  Z  Z 


\+f2/p 


+  In(l  +f2/p)  )  + 


2 1 + e/p 


fln(l  +e/p) 
(74) 


which  is  the  minimax  KL  divergence  between  M{0,  C*)  and  Af(pi,  C). 

Finally,  we  need  to  find  the  minimax  solution  of  pi  and  C.  First,  note  that 


=  (VT)-y*  =  0  (75) 

Also,  A*  =  1/a*  —  1  +  f2/p  for  all  i  and  thus  A*  =  (1  +  f2/p)  I.  It  follows  that 

C*  =  (Vr)_1  A*V_1 

=  (VT)_1(1  +^2/p)IV_1 

=  (l+£2/p)  (V7)-^-1 

V - v - / 

c  t 

=  (l  +  e/p)Ct  (76) 
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B  Proof  of  Theorem  3 

Proof.  We  have  that 


D  (ptllp^a).^).)  =  Et  (lnpt(x)  -  77f)*T1(x)  +  rj2)*T2(x)  -  K{2){r]\2)*,  rj{2)*)  +  lnpo(x)  ) 
=  D  (ft||«,)  -  Ek  (^2)*Ti(x)  +  ^2)*T2(x)  - 
=  ^  (ftllPo)  -  +  r^2)*A2  -  (77) 

follows  from  (29)  and  (30)  that 

D  (p„m-,4».||p,o)-)  =  £(»..,«.  (>)!2,Ti(x)  +  ^2|T2(x)  ~  A'(2)(r,<2)-,7j<2)-)) 

-  Er-.r-  (in)'T^x'>  -  ^“W*)) 

=  I)!2)*A,  +  >)?%  -  Ml  ")  -  Ini' ’‘A,  -  K<D(,(')-)‘ 

(78) 


From  (28),  we  have  that 

D  (p„(«.||Po)  =  £„<>,.  (tf’T.M  -  ^’tf’-))  =  ^A!  -  is-(1)(i)i1>‘)  (79) 

Summing  (77),  (78),  and  (79)  results  in  the  general  Pythagorean  theorem  in 
(31). 
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